上章節講完結構與找尋點方法,這章介紹怎鑲嵌至訓練過程。
初始化參數
class Memory(object): # stored as ( s, a, r, s_ ) in SumTree
epsilon = 0.01 # 避免tranistion太小,最小賦予0.01
alpha = 0.6 # [0~1] 轉換TD-error為priority
beta = 0.4 # importance-sampling, from initial value increasing to 1
beta_increment_per_sampling = 0.001
abs_err_upper = 1. # 裁減絕對值error
資料儲存,第一次儲存會從所有節點中找出最大值。會這樣做是因為想讓剛新增的樣本都能訓練到,以防一開始值很小,埋沒不用。
def store(self, transition):
max_p = np.max(self.tree.tree[-self.tree.capacity:])
if max_p == 0:
max_p = self.abs_err_upper
self.tree.add(max_p, transition) # set the max p for new p
資料採樣,輸入指定要的資料數量(n)。因為prioritized會打亂原本隨機性的資料分布,這裡我們會設定個參數ISWeights去做控制loss值做縮放,詳細的說明可參考paper。
def sample(self, n):
b_idx, b_memory, ISWeights = np.empty((n,), dtype=np.int32), np.empty((n, self.tree.data[0].size)), np.empty((n, 1))
pri_seg = self.tree.total_p / n # priority segment
self.beta = np.min([1., self.beta + self.beta_increment_per_sampling]) # max = 1
min_prob = np.min(self.tree.tree[-self.tree.capacity:]) / self.tree.total_p # for later calculate ISweight
for i in range(n):
a, b = pri_seg * i, pri_seg * (i + 1)
v = np.random.uniform(a, b)
idx, p, data = self.tree.get_leaf(v)
prob = p / self.tree.total_p
ISWeights[i, 0] = np.power(prob/min_prob, -self.beta)
b_idx[i], b_memory[i, :] = idx, data
return b_idx, b_memory, ISWeights
這裡會批量更新,用設定的裁減參數(clipped_errors, alpha)去控制。
def batch_update(self, tree_idx, abs_errors):
abs_errors += self.epsilon # convert to abs and avoid 0
clipped_errors = np.minimum(abs_errors, self.abs_err_upper)
ps = np.power(clipped_errors, self.alpha)
for ti, p in zip(tree_idx, ps):
self.tree.update(ti, p)
prioritized replay我們介紹到這,儲存、採樣跟控制loss皆有點複雜,採樣管理的重要性不小於設計演算法。好哩今天到這邊,下次介紹要怎設計個OpenAI規範化的environment,我們明天見囉!
莫凡RL程式碼參考:https://bre.is/tCA5GuPc